In [1]:
__author__ = 'Alice Jacques <alice.jacques@noirlab.edu>, NOIRLab Astro Data Lab Team <datalab@noao.edu>' 
__version__ = '20201209' 
__datasets__ = ['ls_dr8','sdss_dr16','gaia_dr2','des_dr1','smash_dr2','unwise_dr1','allwise'] 
__keywords__ = ['crossmatch','joint query','mydb','vospace','image cutout']

How to use the pre-crossmatched tables at Astro Data Lab

by Alice Jacques and the NOIRLab Astro Data Lab Team

Goals

  • Learn how to use a pre-crossmatched table to do a joint query on two Data Lab data sets
  • Learn how to do an efficient crossmatch of a user-provided data table against a Data Lab pre-crossmatched table

Summary

Crossmatch table naming template

The crossmatch tables at Astro Data Lab are structured as followed:

schema1.xNpN__table1__schema2__table2

where the N in NpN encode the numerical value of the crossmatching radius (since dots '.' are not allowed in table names).

Example:

ls_dr8.x1p5__tractor_primary_n__gaia_dr2__gaia_source

is a crossmatch table (indicated by the leading x), located in the ls_dr8 schema, and it crossmatches the ls_dr8.tractor_primary_n table with the gaia_dr2.gaia_source table (which lives in the gaia_dr2 schema) within a 1.5 arcseconds radius ('1p5') .

This is admittedly long, but clean, consistent, and most importantly, parsable. The use of double-underscores '__' is to distinguish from single underscores often used in schema and table names.

Columns in crossmatch tables

All crossmatch tables shall be minimalist, i.e. have only these columns: id1,ra1,dec1,id2,ra2,dec2,distance. Column descriptions in the crossmatch table shall contain the original column names in parentheses (makes it parsable).

For example:

ls_dr8.x1p5__tractor_primary_n__gaia_dr2__gaia_source

Column Description Datatype
id1 ID in left/first table (ls_id) BIGINT
ra1 Right ascension in left/first table (ra) DOUBLE
dec1 Declination in left/first table (dec) DOUBLE
id2 ID in right/second table (source_id) BIGINT
ra2 Right ascension in right/second table (ra) DOUBLE
dec2 Declination in right/second table (dec) DOUBLE
distance Distance between ra1,dec1 and ra2,dec2 (arcsec) DOUBLE

Datatypes in crossmatch tables

  • The column data types in a crossmatch table for columns id1 and id2 shall be retained from the mother tables. The example above, BIGINT, is valid in many cases, but need not be for all data sets.
  • The data types for columns ra1, dec1, ra2, dec2 shall be DOUBLE, which they usually will be anyway.
  • The column distance can be either REAL or DOUBLE.

Overview

  • The following 5 data sets are considered the main reference tables and are crossmatched against all data sets (if there is sky overlap) and when a new data set is ingested:
    • latest gaia_drN.gaia_source
    • latest nsc_drN.object
    • latest unwise_drN.object
    • allwise.source
    • latest sdss_drN.specobj
  • "Crossmatch" means for now "single nearest neighbor" (and this is the current mode at Data Lab).
  • Object tables only, not single epoch measurements or metadata tables.
  • For every crossmatch table with table1 as the left/first table and table2 as the right/second table, there exists a corresponding crossmatch table with table2 as the left/first table and table1 as the right/second table.
    • For example, allwise.x1p5__source__des_dr1__main and des_dr1.x1p5__main__allwise__source.

The list of available crossmatch tables can be viewed on our query interface here under their respective schema.

Disclaimer & attribution

If you use this notebook for your published science, please acknowledge the following:

Imports and setup

In [2]:
# std lib
from getpass import getpass

# 3rd party
from astropy.utils.data import download_file  #import file from URL
from matplotlib.ticker import NullFormatter

# Data Lab
from dl import authClient as ac, queryClient as qc, storeClient as sc
from dl.helpers.utils import convert # converts table to Pandas dataframe object

Authentication

Much of the functionality of Data Lab can be accessed without explicitly logging in (the service then uses an anonymous login). But some capacities, for instance saving the results of your queries to your virtual storage space, require a login (i.e. you will need a registered user account).

If you need to log in to Data Lab, issue this command, and respond according to the instructions:

In [3]:
#ac.login(input("Enter user name: (+ENTER) "),getpass("Enter password: (+ENTER) "))
ac.whoAmI()
Out[3]:
'demo00'

Accessing the pre-crossmatched tables

We can use Data Lab's Query Client to access the pre-crossmatched tables hosted by Data Lab. First let's get a total count of the number of objects (nrows) in SDSS DR16 that are also in LS DR8:

In [4]:
%%time
query="SELECT nrows FROM tbl_stat WHERE schema='sdss_dr16' and tbl_name='x1p5__specobj__ls_dr8__tractor_primary'"

# Call query manager
response = qc.query(sql=query)
print(response)
nrows
4542857

CPU times: user 27.4 ms, sys: 2.23 ms, total: 29.6 ms
Wall time: 105 ms

Now let's print just the first 100 rows:

In [5]:
query = "SELECT * FROM sdss_dr16.x1p5__specobj__ls_dr8__tractor_primary LIMIT 100"
response = qc.query(sql=query)
convert(response) # convert the table into a Pandas dataframe object
Out[5]:
id1 ra1 dec1 id2 ra2 dec2 distance
0 3384465917919389696 287.22826 48.064735 8797230351783516 287.228165 48.064735 0.000063
1 3384466192797296640 287.44889 48.229698 8797230414957399 287.448870 48.229697 0.000014
2 3384462344506599424 287.38750 48.168965 8797230414890143 287.387517 48.168933 0.000034
3 3384463718896134144 287.69779 48.382804 8797230477803600 287.697861 48.382752 0.000070
4 3384465093285668864 287.54718 48.407654 8797230477804882 287.547174 48.407548 0.000106
... ... ... ... ... ... ... ...
95 3384471690355435520 287.70990 48.888661 8797230602453456 287.709937 48.888637 0.000034
96 3384469491332179968 287.66389 48.944252 8797230602454731 287.663800 48.944491 0.000247
97 3384480486448457728 287.22115 48.827232 8797230540199804 287.221105 48.827183 0.000057
98 3384477737669388288 287.29420 48.927487 8797230602388155 287.294186 48.927487 0.000009
99 3384470590843807744 287.46812 49.027895 8797230602391658 287.468139 49.027900 0.000013

100 rows × 7 columns

Writing a JOIN query

In order to extract only the relevant columns pertaining to our science question from multiple data tables, we may write a query that uses a JOIN statement. There are 4 main types of JOIN statements that we could use, and which one we decide to choose depends on how we want the information to be extracted.

  1. (INNER) JOIN: Returns rows that have matching values in both tables
  2. LEFT (OUTER) JOIN: Returns all rows from the left table, and the matched rows from the right table
  3. RIGHT (OUTER) JOIN: Returns all rows from the right table, and the matched rows from the left table
  4. FULL (OUTER) JOIN: Returns all rows when there is a match in either left or right table

Take a moment to look over the figure below outlining the various JOIN statement types.
NOTE: the default JOIN is an INNER JOIN.

JOIN LATERAL

In nearest neighbor crossmatch queries, we use JOIN LATERAL, which is like a SQL foreach loop that will iterate over each row in a result set and evaluate a subquery using that row as a parameter.

Joint query of LS and SDSS catalogs

Here we will examine spectroscopic redshifts from SDSS DR16 and photometry from LS DR8. The two crossmatch tables related to these two catalogs are: ls_dr8.x1p5__tractor__sdss_dr16__specobj and sdss_dr16.x1p5__specobj__ls_dr8__tractor_primary. The choice of which of these two crossmatch tables to use should be based on the science question being posed. For instance, the question 'how does a galaxy's structure change with redshift?' is dependent on the redshift values obtained from SDSS DR16, so we should use the crossmatch table that has SDSS DR16 as the first table. So, the relevant information we want from our 3 tables of interest for this example are:

  1. "X" = sdss_dr16.x1p5__specobj__ls_dr8__tractor_primary
    • ra1 (ra of sdss object)
    • dec1 (dec of sdss object)
  2. "S" = sdss_dr16.specobj
    • z (redshift)
    • class (spectroscopic class: GALAXY, QSO, or STAR)
    • veldisp (velocity dispersion)
    • veldisperr (error in velocity dispersion)
  3. "L" = ls_dr8.tractor
    • ra (ra of ls object)
    • dec (dec of ls object)
    • type (morphological model^)
    • g_r (computed g-r color)
    • r_z (computed r-z color)

^PSF=stellar, REX=round exponential galaxy, DEV=deVauc, EXP=exponential, COMP=composite, DUP=Gaia source fit by different model.

Write the query

Now that we know what we want and where we want it from, let's write the query and then print the results on screen.

In [6]:
query = ("""SELECT 
           X.ra1 as ra_sdss,X.dec1 as dec_sdss,
           S.z,S.class,S.veldisp,S.veldisperr,
           L.ra as ra_ls,L.dec as dec_ls,L.type,L.g_r,L.r_z
         FROM sdss_dr16.x1p5__specobj__ls_dr8__tractor_primary as X 
         JOIN sdss_dr16.specobj as S ON X.id1 = S.specobjid 
         JOIN ls_dr8.tractor AS L ON X.id2 = L.ls_id
         WHERE X.ra1 BETWEEN %s and %s and X.dec1 BETWEEN %s and %s
         """) %(110,200,7.,40.)  #large region
print(query)
SELECT 
           X.ra1 as ra_sdss,X.dec1 as dec_sdss,
           S.z,S.class,S.veldisp,S.veldisperr,
           L.ra as ra_ls,L.dec as dec_ls,L.type,L.g_r,L.r_z
         FROM sdss_dr16.x1p5__specobj__ls_dr8__tractor_primary as X 
         JOIN sdss_dr16.specobj as S ON X.id1 = S.specobjid 
         JOIN ls_dr8.tractor AS L ON X.id2 = L.ls_id
         WHERE X.ra1 BETWEEN 110 and 200 and X.dec1 BETWEEN 7.0 and 40.0
         
In [7]:
%%time
response = qc.query(sql=query) # default format is a CSV file
CPU times: user 1.56 s, sys: 520 ms, total: 2.08 s
Wall time: 13.5 s
In [8]:
result = convert(response) # convert to a Pandas dataframe object
result
Out[8]:
ra_sdss dec_sdss z class veldisp veldisperr ra_ls dec_ls type g_r r_z
0 123.12650 39.993317 0.124905 GALAXY 221.7120 10.6741 123.126449 39.993302 DEV 1.153230 0.772179
1 123.36125 39.996015 -0.000118 STAR 0.0000 0.0000 123.361280 39.995986 PSF 1.564990 1.792350
2 123.23940 39.990355 0.067814 GALAXY 80.2651 13.9739 123.239365 39.990334 EXP 0.747831 0.448318
3 123.18048 39.935782 0.067945 GALAXY 102.3340 14.9775 123.180441 39.935786 EXP 0.899620 0.669617
4 123.25379 39.959822 0.000056 STAR 0.0000 0.0000 123.253686 39.959713 PSF 1.730960 2.775590
... ... ... ... ... ... ... ... ... ... ... ...
1122603 146.11068 39.875726 0.610066 GALAXY 187.2020 40.1035 146.110701 39.875758 DEV 1.885480 1.482170
1122604 146.28698 39.864958 0.483563 GALAXY 211.4210 58.9968 146.286971 39.864966 DEV 1.926210 1.134800
1122605 146.15299 39.836558 3.306930 QSO 0.0000 0.0000 146.152998 39.836521 PSF 1.172630 0.647699
1122606 146.00503 39.873979 3.116330 QSO 0.0000 0.0000 146.005041 39.873990 PSF 0.503344 0.147423
1122607 146.02245 39.872488 0.108817 GALAXY 79.0409 12.3372 146.022464 39.872467 COMP 0.785284 0.561127

1122608 rows × 11 columns

Saving results to VOSpace

VOSpace is a convenient storage space for users to save their work. It can store any data or file type. We can save the results from the same query to our virtual storage space:

In [9]:
response = qc.query(sql=query, fmt='csv', out='vos://testresult.csv')

Let's ensure the file was saved in VOSpace:

In [10]:
sc.ls(name='vos://testresult.csv')
Out[10]:
'testresult.csv'

Now let's remove the file we just saved to VOSpace:

In [11]:
sc.rm (name='vos://testresult.csv')
Out[11]:
'OK'

Let's ensure the file was removed from VOSpace:

In [12]:
sc.rm (name='vos://testresult.csv')
Out[12]:
'A Node does not exist with the requested URI.'

Saving results to MyDB

MyDB is a useful OS remote per-user relational database that can store data tables. Furthermore, the results of queries can be directly saved to MyDB, as we show in the following example:

In [13]:
response = qc.query(sql=query, fmt='csv', out='mydb://testresult')

Ensure the table has been saved to MyDB by calling the mydb_list() function, which will list all tables currently in a user's MyDB:

In [14]:
print(qc.mydb_list(),"\n")
asv_string1,
bsv_string1,
csv_string1,
csv_vizier,
delme,
foo2,
gaia_sample,created:2020-12-09 13:39:42 MST
gaia_sample_xmatch,created:2020-12-09 13:39:44 MST
gals,created:2020-11-12 15:37:38 MST
gals2,created:2020-11-12 15:30:15 MST
galspec_ls,
hygdata_v3,created:2020-11-06 15:04:35 MST
import1,
mags,
mags3,
objects,
pandas1,
star_clusters,created:2020-11-12 15:47:04 MST
test1a,
test2,
test3,
testresult,created:2020-12-09 14:00:56 MST
timeseries,created:2020-10-01 15:47:15 MST
tsv_string1,
usno_test,
zxz,
 

Now let's drop the table from our MyDB.

In [15]:
qc.mydb_drop('testresult')
Out[15]:
'OK'

Ensure it has been removed by calling the mydb_list() function again:

In [16]:
print(qc.mydb_list(),"\n")
asv_string1,
bsv_string1,
csv_string1,
csv_vizier,
delme,
foo2,
gaia_sample,created:2020-12-09 13:39:42 MST
gaia_sample_xmatch,created:2020-12-09 13:39:44 MST
gals,created:2020-11-12 15:37:38 MST
gals2,created:2020-11-12 15:30:15 MST
galspec_ls,
hygdata_v3,created:2020-11-06 15:04:35 MST
import1,
mags,
mags3,
objects,
pandas1,
star_clusters,created:2020-11-12 15:47:04 MST
test1a,
test2,
test3,
timeseries,created:2020-10-01 15:47:15 MST
tsv_string1,
usno_test,
zxz,
 

Crossmatch a user-provided data table and a pre-crossmatched table

We can construct a query to run a crossmatch in the database using the q3c_join() function, which identifies all matching objects within a specified radius in degrees (see details on using Q3C functions). For this example, we will search only for the single nearest neighbor. For different examples of crossmatching, see our How to crossmatch tables notebook.

First, let's query a small selection of sample data from the Data Lab database and store it in MyDB as gaia_sample. This will act as our "user-provided table".

In [17]:
%%time
query = """SELECT source_id,ra,dec,parallax,pmra,pmdec FROM gaia_dr2.gaia_source 
WHERE ra<200 AND ra>124 AND random_id<10 LIMIT 70000"""
response = qc.query(sql=query)
convert(response)
CPU times: user 211 ms, sys: 41.1 ms, total: 252 ms
Wall time: 3.57 s
Out[17]:
source_id ra dec parallax pmra pmdec
0 6063068343808472704 197.399687 -58.090203 -0.098219 -5.878370 0.669591
1 5311388830071672064 136.379515 -53.552073 -0.350885 -2.412417 4.913027
2 5253273391401802880 152.682263 -61.906258 1.284080 -5.050487 0.935614
3 1150976404114388992 126.455169 85.364199 NaN NaN NaN
4 5336579809692920192 171.540877 -58.850094 0.066621 -5.200648 2.182430
... ... ... ... ... ... ...
69995 3913618287955193088 176.969466 11.788763 NaN NaN NaN
69996 5390277174450667520 165.908202 -41.241788 0.178047 -5.586176 1.011907
69997 6147699639264396160 180.013010 -44.063650 0.124505 0.799060 -0.494301
69998 5254280024328426496 160.335470 -59.995110 0.220531 -4.228320 2.716516
69999 5340612788286304256 165.849994 -56.655694 0.364628 -6.267243 3.178044

70000 rows × 6 columns

In [18]:
qc.query(sql=query,out='mydb://gaia_sample',drop=True)
Out[18]:
'OK'

Write a crossmatch query

Next let's crossmatch our gaia_sample table with Data Lab's pre-crossmatched table between SMASH DR2 and allWISE smash_dr2.x1p5__object__allwise__source. We'll write our crossmatch query using the q3c_join() function as well as the q3c_dist() function, searching for the nearest neighbor within a 1.5 arcsec radius (which must be converted into degrees for the query, so we divide by 3600.0). We will then save it in MyDB as gaia_sample_xmatch.

In [19]:
%%time
qu = """SELECT
        G.source_id,ss.id1,ss.id2,G.ra,G.dec,ss.ra1,ss.dec1,ss.ra2,ss.dec2,
        (q3c_dist(G.ra,G.dec,ss.ra1,ss.dec1)*3600.0) as dist_arcsec
        FROM mydb://gaia_sample AS G
        JOIN LATERAL (
            SELECT S.id1,S.id2,S.ra1,S.dec1,S.ra2,S.dec2
            FROM 
                smash_dr2.x1p5__object__allwise__source AS S
            WHERE 
                q3c_join(G.ra,G.dec,S.ra1,S.dec1, 1.5/3600.0)
            ORDER BY
                q3c_dist(G.ra,G.dec,S.ra1,S.dec1)
            ASC LIMIT 1
            ) AS ss ON true;
    """
resp = qc.query(sql=qu)
convert(resp)
CPU times: user 35.6 ms, sys: 2.87 ms, total: 38.5 ms
Wall time: 752 ms
Out[19]:
source_id id1 id2 ra dec ra1 dec1 ra2 dec2 dist_arcsec
0 5205696522501120512 Field80.880606 1482074301351023387 150.869988 -74.496170 150.869990 -74.496177 150.869098 -74.496177 0.023617
1 5205674807146027008 Field80.61312 1535074301351011624 151.032199 -74.745129 151.032198 -74.745131 151.032752 -74.745275 0.008880
2 6141410299609363840 Field127.835819 1993039401351059307 199.455692 -38.742563 199.455696 -38.742564 199.455683 -38.742551 0.010309
3 5467823633613278720 Field85.328869 1577028801351061337 157.482148 -28.132586 157.482144 -28.132586 157.482181 -28.132587 0.013867
4 5659712743651975424 Field76.497391 1486024301351024709 149.263676 -24.547096 149.263677 -24.547099 149.263713 -24.547078 0.011663
... ... ... ... ... ... ... ... ... ... ...
637 5388501192592385536 Field91.325126 1638042501351015781 163.258046 -42.610578 163.258049 -42.610577 163.258049 -42.610625 0.009637
638 5459798298242774016 Field77.1011912 1493033401351048226 149.796546 -33.028633 149.796547 -33.028635 149.796653 -33.028629 0.005438
639 6152131053375898496 Field123.257555 1918040901351043838 191.736605 -40.442602 191.736609 -40.442599 191.736650 -40.442600 0.016839
640 5441173017945297920 Field163.329494 1591037901351024901 159.794886 -38.270692 159.794889 -38.270695 159.794912 -38.270608 0.012900
641 5390277174450667520 Field165.440845 1661040901351015490 165.908202 -41.241788 165.908207 -41.241784 165.908317 -41.241735 0.017132

642 rows × 10 columns

In [20]:
qc.query(sql=qu,out='mydb://gaia_sample_xmatch',drop=True)
Out[20]:
'OK'

Write the joint query

Now we can write a query using the JOIN statement in order to extract the columns we want from our tables of interest. Just as in the previous section, let's first make an outline of which tables we'd like to extract columns from.

  1. "X" = mydb://gaia_sample_xmatch
    • source_id (source id from gaia dr2)
    • id1 (source id from smash dr1)
    • id2 (source id from allwise)
    • ra (ra value from gaia dr2)
    • dec (dec value from gaia dr2)
  2. "s" = smash_dr2.object
    • gmag (weighted-avarage, calibrated g-band magnitude, 99.99 if no detection)
    • rmag (weighted-avarage, calibrated r-band magnitude, 99.99 if no detection)
    • zmag (weighted-avarege, calibrated z-band magnitude, 99.99 if no detection)
  3. "a" = allwise.source
    • w1mpro (W1 magnitude measured with profile-fitting photometry)
    • w2mpro (W2 magnitude measured with profile-fitting photometry)
    • w3mpro (W3 magnitude measured with profile-fitting photometry)
  4. "g" = mydb://gaia_sample
    • parallax
    • pmra (proper motion in right ascension direction)
    • pmdec (proper motion in declination direction)
In [21]:
query = ("""SELECT 
           X.source_id,X.id1,X.id2,X.ra,X.dec,
           s.gmag,s.rmag,s.zmag,
           a.w1mpro,a.w2mpro,a.w3mpro,
           g.parallax,g.pmra,g.pmdec
         FROM mydb://gaia_sample_xmatch as X 
         JOIN smash_dr2.object as s ON X.id1 = s.id 
         JOIN allwise.source AS a ON X.id2 = a.cntr
         JOIN mydb://gaia_sample AS g ON X.source_id = g.source_id
         """)
print(query)
SELECT 
           X.source_id,X.id1,X.id2,X.ra,X.dec,
           s.gmag,s.rmag,s.zmag,
           a.w1mpro,a.w2mpro,a.w3mpro,
           g.parallax,g.pmra,g.pmdec
         FROM mydb://gaia_sample_xmatch as X 
         JOIN smash_dr2.object as s ON X.id1 = s.id 
         JOIN allwise.source AS a ON X.id2 = a.cntr
         JOIN mydb://gaia_sample AS g ON X.source_id = g.source_id
         
In [22]:
response = qc.query(sql=query) # default format is a CSV file
convert(response)
Out[22]:
source_id id1 id2 ra dec gmag rmag zmag w1mpro w2mpro w3mpro parallax pmra pmdec
0 6131589629253634304 Field168.1156764 1853045501351008859 185.012314 -45.934668 19.4154 18.3814 17.7553 15.931 16.042 13.054 0.158196 -9.417819 -4.594620
1 5441239847637128832 Field163.1279533 1587039401351055638 158.835349 -38.680901 20.3020 99.9900 99.9900 15.739 15.543 12.188 1.568020 -13.376750 0.883958
2 5384595418049695616 Field104.1030623 1741039401351048892 174.413163 -39.098332 20.3235 19.4183 18.8627 17.403 17.325 12.088 0.878178 -2.315444 -1.981345
3 5199100724044211456 Field87.1011857 1656078801351008164 164.270068 -79.405031 19.8656 18.3270 17.0464 14.755 14.799 12.924 1.008309 -7.250559 3.201105
4 5467597104154887040 Field85.1028867 1577028801351048828 158.070415 -28.561233 20.3464 19.3088 18.7218 17.126 16.309 12.468 0.482538 -3.713790 0.281505
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
107 5198757405832808576 Field87.1204921 1590080301351016502 156.965653 -80.528784 21.9378 99.9900 99.9900 16.081 15.829 12.932 1.626588 2.234972 -16.577670
108 5381961366150238080 Field166.1703433 1727044001351026575 173.588279 -43.889225 17.9590 99.9900 99.9900 14.997 14.988 12.565 0.604291 -7.352319 -5.747996
109 5198195207497561088 Field106.1189889 1674080301351024353 170.834267 -80.526241 18.5196 17.9265 17.5419 16.117 16.245 12.933 0.259513 -4.197899 -1.071158
110 5458462327193883392 Field161.1718085 1516034901351031680 151.645492 -34.804156 18.4303 99.9900 99.9900 15.587 15.786 12.449 0.353570 -7.475217 5.853638
111 5459798298242774016 Field77.1011912 1493033401351048226 149.796546 -33.028633 22.3068 20.7488 18.7407 16.427 15.961 12.290 0.869519 -0.222829 2.052986

112 rows × 14 columns

Speed test

Here we compare the speed of using the q3c_join() function to crossmatch directly in a JOIN query (query1) versus using a pre-crossmatched table in a JOIN query (query2). We select objects from the two catalogs and retrieve the same specified columns for the two queries. We will see that query1 times out after 300 seconds (5 minutes) and fails to retrieve results, while query2 runs for about 60-90 seconds (1-1.5 minutes) and will retrieve the 3.6 million rows we queried for.

In [23]:
%%time
query1 = """SELECT
           a.cntr as id1,a.ra as ra1,a.dec as dec1,a.pmdec,a.pmra,a.w1mpro,a.w2mpro,
           gg.specobjid as id2,gg.ra as ra2,gg.dec as dec2,gg.z,gg.class,gg.veldisp,gg.veldisperr,
           (q3c_dist(a.ra,a.dec,gg.ra,gg.dec)*3600.0) as dist_arcsec 
         FROM 
            allwise.source AS a
         INNER JOIN LATERAL (
            SELECT s.specobjid,s.ra,s.dec,s.z,s.class,s.veldisp,s.veldisperr
            FROM 
                sdss_dr16.specobj AS s
            WHERE
                q3c_join(a.ra, a.dec, s.ra, s.dec, 1.5/3600.0)
            ORDER BY
                random()
            ASC LIMIT 1
        ) as gg ON true
"""
resp1 = qc.query(sql=query1,timeout=300)
convert(resp1)
---------------------------------------------------------------------------
queryClientError                          Traceback (most recent call last)
<timed exec> in <module>

/data0/sw/anaconda3/lib/python3.7/site-packages/dl/Util.py in __call__(self, *args, **kw)
     80             return function(self.obj, *args, **kw)
     81         else:
---> 82             return function(*args, **kw)
     83 
     84     def __repr__(self):

/data0/sw/anaconda3/lib/python3.7/site-packages/dl/queryClient.py in query(token, adql, sql, fmt, out, async_, drop, profile, **kw)
    541     return qc_client._query (token=def_token(token), adql=adql, sql=sql, 
    542                              fmt=fmt, out=out, async_=async_, drop=drop, profile=profile,
--> 543                              **kw)
    544 
    545 

/data0/sw/anaconda3/lib/python3.7/site-packages/dl/queryClient.py in _query(self, token, adql, sql, fmt, out, async_, drop, profile, **kw)
   2022         r = requests.get (dburl, headers=headers, timeout=timeout)
   2023         if r.status_code != 200:
-> 2024             raise queryClientError (r.text)
   2025         resp = qcToString(r.content)
   2026 

queryClientError: Error: QM: Query timeout at 300 sec
In [24]:
%%time
query2 = """SELECT 
           X.id1,X.id2,X.ra1,X.dec1,X.ra2,X.dec2,X.distance as dist_arcsec,
           a.pmdec,a.pmra,a.w1mpro,a.w2mpro,
           s.z,s.class,s.veldisp,s.veldisperr
         FROM 
             allwise.x1p5__source__sdss_dr16__specobj as X 
         JOIN 
             allwise.source as a ON X.id1 = a.cntr 
         JOIN 
             sdss_dr16.specobj AS s ON X.id2 = s.specobjid
         """
resp2 = qc.query(sql=query2)
convert(resp2)
CPU times: user 13.9 s, sys: 4.06 s, total: 17.9 s
Wall time: 1min 1s
Out[24]:
id1 id2 ra1 dec1 ra2 dec2 dist_arcsec pmdec pmra w1mpro w2mpro z class veldisp veldisperr
0 1601351000041 4902396990288318464 0.609037 -2.119844 0.609022 -2.119838 0.000016 -104.0 373.0 13.899 13.602 0.198010 GALAXY 290.2170 12.5112
1 1601351000062 4902396440532504576 0.608449 -2.010911 0.608431 -2.010910 0.000017 103.0 226.0 14.423 14.137 0.285261 GALAXY 250.9350 15.4101
2 1601351000070 4902398089799946240 0.560765 -2.193133 0.560709 -2.193130 0.000057 355.0 524.0 14.688 14.423 0.279780 GALAXY 263.8380 17.4279
3 1601351000092 4902395890776690688 0.647265 -2.032009 0.647439 -2.031991 0.000175 -150.0 -562.0 15.051 14.778 0.383525 GALAXY 179.6240 22.3847
4 1601351000100 4902398914433667072 0.695043 -2.077576 0.695027 -2.077596 0.000026 -84.0 -462.0 15.104 14.949 0.441733 GALAXY 275.9000 27.6191
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3620144 112101351002760 6925422251849568256 0.165354 11.445586 0.165377 11.445490 0.000099 -465.0 -64.0 15.321 15.206 0.416764 GALAXY 312.3530 49.0224
3620145 112101351002765 6960521410528956416 0.385179 11.403072 0.385121 11.403295 0.000230 482.0 -270.0 15.308 15.253 0.631752 GALAXY 261.1230 46.3088
3620146 112101351002771 6960503268587098112 0.176655 11.435352 0.176650 11.435383 0.000031 71.0 -84.0 15.437 15.210 0.539123 GALAXY 240.7100 31.3019
3620147 112101351002790 -5446833951671562240 0.401569 11.521001 0.401618 11.520917 0.000097 94.0 200.0 15.374 15.120 0.896272 GALAXY 282.4040 69.9966
3620148 112101351002794 -5446835600939003904 0.335864 11.490887 0.335873 11.490872 0.000017 -873.0 962.0 15.528 15.458 0.731471 GALAXY 18.0887 87.3522

3620149 rows × 15 columns

For completeness, we switch the order of the queries and query from a different catalog.

We again select objects from two catalogs and retrieve the same specified columns for two queries. query3 uses a pre-crossmatched table in a JOIN query and query4 crossmatches directly in the JOIN query. We will see that query3 runs for about 60-90 seconds (1-1.5 minutes) and will retrieve the 4.4 million rows we queried for, while query4 times out after 300 seconds (5 minutes) and fails to retrieve results.

In [25]:
%%time
query3 = """SELECT 
           X.id1,X.id2,X.ra1,X.dec1,X.ra2,X.dec2,X.distance as dist_arcsec,
           u.mag_w1_vg,u.mag_w2_vg,s.z,s.class,s.veldisp,s.veldisperr
         FROM 
             unwise_dr1.x1p5__object__sdss_dr16__specobj as X 
         JOIN 
             unwise_dr1.object as u ON X.id1 = u.unwise_objid 
         JOIN 
             sdss_dr16.specobj AS s ON X.id2 = s.specobjid
         ORDER BY 
             random()
         """
resp3 = qc.query(sql=query3,timeout=300)
convert(resp3)
CPU times: user 18.2 s, sys: 4.95 s, total: 23.1 s
Wall time: 1min 25s
Out[25]:
id1 id2 ra1 dec1 ra2 dec2 dist_arcsec mag_w1_vg mag_w2_vg z class veldisp veldisperr
0 0468m016o0020412 1703513559851362304 46.255816 -0.903448 46.255831 -0.903048 0.000400 12.3448 12.2574 -0.000048 STAR 0.000 0.00000
1 1715p090o0022354 6047310754495090688 170.900051 8.803995 170.900050 8.803976 0.000019 15.0182 14.8719 0.402605 GALAXY 172.040 21.13650
2 1557p000o0041218 4313588992803360768 156.071266 0.297108 156.071550 0.297323 0.000356 17.6309 inf 1.293800 QSO 0.000 0.00000
3 3539p000o0003859 -8577009709786836992 353.578689 -0.520125 353.578720 -0.520124 0.000031 15.6772 15.6449 -0.000616 STAR 0.000 0.00000
4 3327p015o0010854 -8125554071448563712 333.253925 1.358684 333.253890 1.358663 0.000040 15.7161 14.9163 1.002990 QSO 0.000 0.00000
... ... ... ... ... ... ... ... ... ... ... ... ... ...
4374677 0136m046o0017016 7929876981453443072 14.242345 -4.048496 14.242420 -4.048415 0.000110 14.0831 13.8167 0.147089 GALAXY 227.048 9.95283
4374678 2443p469o0023939 702672632423671808 244.594477 47.238421 244.594510 47.238317 0.000107 14.3527 14.0577 0.151071 GALAXY 124.334 11.04230
4374679 1402p272o0001087 2349897748653828096 139.615330 26.542583 139.615220 26.542517 0.000118 14.1056 13.7175 0.216664 GALAXY 236.511 13.31570
4374680 1178p469o0007526 4144547596214751232 118.558808 46.570576 118.558260 46.570631 0.000381 15.7396 15.5694 -0.000075 STAR 0.000 0.00000
4374681 2232p166o0009142 3112084725376772096 222.700145 16.367547 222.700190 16.367545 0.000043 13.7144 13.7089 0.110793 GALAXY 117.573 15.58460

4374682 rows × 13 columns

In [26]:
%%time
query4 = """SELECT
           u.unwise_objid as id1,u.ra as ra1,u.dec as dec1,u.mag_w1_vg,u.mag_w2_vg,
           ss.specobjid as id2,ss.ra as ra2,ss.dec as dec2,ss.z,ss.class,ss.veldisp,ss.veldisperr,
           (q3c_dist(u.ra,u.dec,ss.ra,ss.dec)*3600.0) as dist_arcsec 
         FROM 
            unwise_dr1.object AS u
         INNER JOIN LATERAL (
            SELECT s.specobjid,s.ra,s.dec,s.z,s.class,s.veldisp,s.veldisperr
            FROM 
                sdss_dr16.specobj AS s
            WHERE
                q3c_join(u.ra, u.dec, s.ra, s.dec, 1.5/3600.0)
            ORDER BY
                random()
            ASC LIMIT 1
        ) as ss ON true
"""
resp4 = qc.query(sql=query4,timeout=300)
convert(resp4)
---------------------------------------------------------------------------
queryClientError                          Traceback (most recent call last)
<timed exec> in <module>

/data0/sw/anaconda3/lib/python3.7/site-packages/dl/Util.py in __call__(self, *args, **kw)
     80             return function(self.obj, *args, **kw)
     81         else:
---> 82             return function(*args, **kw)
     83 
     84     def __repr__(self):

/data0/sw/anaconda3/lib/python3.7/site-packages/dl/queryClient.py in query(token, adql, sql, fmt, out, async_, drop, profile, **kw)
    541     return qc_client._query (token=def_token(token), adql=adql, sql=sql, 
    542                              fmt=fmt, out=out, async_=async_, drop=drop, profile=profile,
--> 543                              **kw)
    544 
    545 

/data0/sw/anaconda3/lib/python3.7/site-packages/dl/queryClient.py in _query(self, token, adql, sql, fmt, out, async_, drop, profile, **kw)
   2022         r = requests.get (dburl, headers=headers, timeout=timeout)
   2023         if r.status_code != 200:
-> 2024             raise queryClientError (r.text)
   2025         resp = qcToString(r.content)
   2026 

queryClientError: Error: QM: Query timeout at 300 sec

Appendix

A1. unWISE DR1 vs LS DR8

Here we will compare two images of the same object from two different catalogs, unWISE DR1 and LS DR8.

Function to retrieve cutouts

In [27]:
def make_cutout_comparison_table(ra_in1, dec_in1, ra_in2, dec_in2):
    """
    Obtain unWISE DR1 and LS DR8 color JPEG images from Legacy Survey team cutout tool at NERSC
    """    
    un_img = []
    ls_img = []
    
    for i in range(len(ra_in1)):
        un_cutout_url = 'https://www.legacysurvey.org/viewer/cutout.jpg?ra=%g&dec=%g&layer=unwise-neo6&pixscale=0.3' % (ra_in1[i],dec_in1[i])
        img = plt.imread(download_file(un_cutout_url,cache=True,show_progress=False,timeout=120))
        un_img.append(img)
        
        ls_cutout_url = 'https://www.legacysurvey.org/viewer/cutout.jpg?ra=%g&dec=%g&layer=ls-dr8&pixscale=0.3' % (ra_in2[i],dec_in2[i])
        img = plt.imread(download_file(ls_cutout_url,cache=True,show_progress=False,timeout=120))
        ls_img.append(img)

    return un_img,ls_img

Function to generate plots

In [28]:
def plot_cutouts(img1,img2,cat1,cat2):
    """
    Plot images in two rows with 5 images in each row
    """
    fig = plt.figure(figsize=(21,7))

    for i in range(len(img1)):
        ax = fig.add_subplot(2,6,i+1)
        ax.imshow(img1[i])
        ax.xaxis.set_major_formatter(NullFormatter())
        ax.yaxis.set_major_formatter(NullFormatter())
        ax.tick_params(axis='both',which='both',length=0)
        ax.text(0.02,0.93,'ra=%.5f'%list_ra1[i],transform=ax.transAxes,fontsize=12,color='white')
        ax.text(0.02,0.85,'dec=%.5f'%list_dec1[i],transform=ax.transAxes,fontsize=12,color='white')
        ax.text(0.02,0.77,cat1,transform=ax.transAxes,fontsize=12,color='white')

        ax = fig.add_subplot(2,6,i+7)
        ax.imshow(img2[i])
        ax.xaxis.set_major_formatter(NullFormatter())
        ax.yaxis.set_major_formatter(NullFormatter())
        ax.tick_params(axis='both',which='both',length=0)
        ax.text(0.02,0.93,'ra=%.5f'%list_ra2[i],transform=ax.transAxes,fontsize=12,color='white')
        ax.text(0.02,0.85,'dec=%.5f'%list_dec2[i],transform=ax.transAxes,fontsize=12,color='white')
        ax.text(0.02,0.77,cat2,transform=ax.transAxes,fontsize=12,color='white')

    plt.subplots_adjust(wspace=0.02, hspace=0.03)

Write query to randomly select 5 RA/dec targets from unWISE and LS DR8 crossmatch table

... then save them as arrays.

In [29]:
%%time
q = """SELECT ra1,dec1,ra2,dec2 
FROM unwise_dr1.x1p5__object__ls_dr8__tractor_primary 
WHERE ra1>300 AND dec1>33 ORDER BY random() LIMIT 5"""

r = qc.query(sql=q,fmt='pandas')

list_ra1=r['ra1'].values       # ".values" convert to numpy array
list_dec1=r['dec1'].values
list_ra2=r['ra2'].values       
list_dec2=r['dec2'].values
CPU times: user 30.7 ms, sys: 1.25 ms, total: 31.9 ms
Wall time: 36.2 s
In [30]:
cat1='unWISE'
cat2='ls dr8'
un_img,ls_img = make_cutout_comparison_table(list_ra1,list_dec1,list_ra2,list_dec2)
plot_cutouts(un_img,ls_img,cat1,cat2)

A2. SDSS vs DES DR1

Here we will compare two images of the same object from two different catalogs, SDSS and DES DR1.

In [31]:
def make_cutout_comparison_table2(ra_in1, dec_in1, ra_in2, dec_in2):
    """
    Obtain SDSS and DES DR1 color JPEG images from Legacy Survey team cutout tool at NERSC
    """    
    sd_img = []
    de_img = []
    
    for i in range(len(ra_in1)):
        sd_cutout_url = 'https://www.legacysurvey.org/viewer/cutout.jpg?ra=%g&dec=%g&layer=sdss&pixscale=0.25' % (ra_in1[i],dec_in1[i])
        img = plt.imread(download_file(sd_cutout_url,cache=True,show_progress=False,timeout=120))
        sd_img.append(img)
        
        de_cutout_url = 'https://www.legacysurvey.org/viewer/cutout.jpg?ra=%g&dec=%g&layer=des-dr1&pixscale=0.25' % (ra_in2[i],dec_in2[i])
        img = plt.imread(download_file(de_cutout_url,cache=True,show_progress=False,timeout=120))
        de_img.append(img)

    return sd_img,de_img

Write query to randomly select 5 RA/dec targets from SDSS DR16 and DES DR1 crossmatch table

... then save them as arrays.

In [32]:
%%time
q = """SELECT ra1,dec1,ra2,dec2 
FROM sdss_dr16.x1p5__specobj__des_dr1__main 
ORDER BY random() LIMIT 5"""

r = qc.query(sql=q,fmt='pandas')

list_ra1=r['ra1'].values       # ".values" convert to numpy array
list_dec1=r['dec1'].values
list_ra2=r['ra2'].values       
list_dec2=r['dec2'].values
CPU times: user 28.6 ms, sys: 0 ns, total: 28.6 ms
Wall time: 199 ms
In [33]:
cat1='sdss dr16'
cat2='des dr1'
sd_img,de_img = make_cutout_comparison_table2(list_ra1,list_dec1,list_ra2,list_dec2)
plot_cutouts(sd_img,de_img,cat1,cat2)

A3. Cool galaxy finds: SDSS vs DES DR1

In [34]:
qc.mydb_import('gals','./gals.csv',drop=True)
Out[34]:
'OK'
In [35]:
def make_cutout_comparison_table3(ra_in, dec_in):
    """
    Obtain SDSS and DES DR1 color JPEG images from Legacy Survey team cutout tool at NERSC
    """    
    sd_img = []
    de_img = []
    
    for i in range(len(ra_in)):
        sd_cutout_url = 'https://www.legacysurvey.org/viewer/cutout.jpg?ra=%g&dec=%g&layer=sdss&pixscale=0.5' % (ra_in[i],dec_in[i])
        img = plt.imread(download_file(sd_cutout_url,cache=True,show_progress=False,timeout=120))
        sd_img.append(img)
        
        de_cutout_url = 'https://www.legacysurvey.org/viewer/cutout.jpg?ra=%g&dec=%g&layer=des-dr1&pixscale=0.5' % (ra_in[i],dec_in[i])
        img = plt.imread(download_file(de_cutout_url,cache=True,show_progress=False,timeout=120))
        de_img.append(img)

    return sd_img,de_img
In [36]:
def plot_cutouts2(img1,img2,cat1,cat2):
    """
    Plot images in two rows with 5 images in each row
    """
    fig = plt.figure(figsize=(21,7))

    for i in range(len(img1)):
        ax = fig.add_subplot(2,6,i+1)
        ax.imshow(img1[i])
        ax.xaxis.set_major_formatter(NullFormatter())
        ax.yaxis.set_major_formatter(NullFormatter())
        ax.tick_params(axis='both',which='both',length=0)
        ax.text(0.02,0.93,'ra=%.5f'%list_ra[i],transform=ax.transAxes,fontsize=12,color='white')
        ax.text(0.02,0.85,'dec=%.5f'%list_dec[i],transform=ax.transAxes,fontsize=12,color='white')
        ax.text(0.02,0.77,cat1,transform=ax.transAxes,fontsize=12,color='white')

        ax = fig.add_subplot(2,6,i+7)
        ax.imshow(img2[i])
        ax.xaxis.set_major_formatter(NullFormatter())
        ax.yaxis.set_major_formatter(NullFormatter())
        ax.tick_params(axis='both',which='both',length=0)
        ax.text(0.02,0.93,'ra=%.5f'%list_ra[i],transform=ax.transAxes,fontsize=12,color='white')
        ax.text(0.02,0.85,'dec=%.5f'%list_dec[i],transform=ax.transAxes,fontsize=12,color='white')
        ax.text(0.02,0.77,cat2,transform=ax.transAxes,fontsize=12,color='white')

    plt.subplots_adjust(wspace=0.02, hspace=0.03)
In [37]:
qg = """SELECT ra,dec FROM mydb://gals LIMIT 5"""
rg = qc.query(sql=qg)
rp = convert(rg)
list_ra=rp['ra'].values       # ".values" convert to numpy array
list_dec=rp['dec'].values
cat1='sdss dr16'
cat2='des dr1'
sd_img,de_img = make_cutout_comparison_table3(list_ra,list_dec)
plot_cutouts2(sd_img,de_img,cat1,cat2)
In [38]:
qg = """SELECT ra,dec FROM mydb://gals LIMIT 5 OFFSET 5"""
rg = qc.query(sql=qg)
rp = convert(rg)
list_ra=rp['ra'].values       # ".values" convert to numpy array
list_dec=rp['dec'].values
cat1='sdss dr16'
cat2='des dr1'
sd_img,de_img = make_cutout_comparison_table3(list_ra,list_dec)
plot_cutouts2(sd_img,de_img,cat1,cat2)
In [39]:
qg = """SELECT ra,dec FROM mydb://gals LIMIT 5 OFFSET 10"""
rg = qc.query(sql=qg)
rp = convert(rg)
list_ra=rp['ra'].values       # ".values" convert to numpy array
list_dec=rp['dec'].values
cat1='sdss dr16'
cat2='des dr1'
sd_img,de_img = make_cutout_comparison_table3(list_ra,list_dec)
plot_cutouts2(sd_img,de_img,cat1,cat2)

Resources & references